I have recently been thinking about how a discipline changes across the years. Watching such changes over the course of a long career must be simultaneously bewildering and exciting. Give a long-enough timeframe, it is easy to pick the landmark papers or ideas which have shifted the field. Noam Chomsky's 1967 review of Skinner's book on verbal behavior, Green & Swets' book on signal detection theory, Kahneman & Tversky's work on decision making are just a few obvious examples. However, identifying these things as they happen is not always quite as easy. From my perspective, it seemed to me like psychology had some pretty groundbreaking stuff happening when I was in grad school, and so I thought I would investigate how interest in certain topics may have waxed or waned in the last few years.

To investigate this idea, I thought I'd start at a very basic level - getting the count of the number of publications which turn up when searching for a given term (e.g. affect). I first tried this with google scholar, but google scholar lacks a dedicated api, and there there is no easy way to scrape the webpage for results. Furthermore, the returned results from google scholar are too broad. The results would have been noisier than they should be, and so I looked around for something else.

It turns out that Scopus has a pretty good web interface which minimizes the hassle of retrieving this type of data. Searching for some term will lead you to a page that looks like this:

Looking to the top, we see a very welcome button: Analyze search results. Yes, I'll have one of those, please! Following that takes you to a page which allows you to explore the data interactively, as well as download a csv. So, that's what I did. I created a file which contained the number of search results within psychology for a number of different search terms. In particular, I obtained results for each of the following terms:

  • Affect
  • Aggresi*
  • Attitude OR Belief
  • Cheat
  • Cultur*
  • Deceit*
  • Deception
  • Dishonest
  • Group
  • Honest OR Honesty
  • Influence OR Persuasion
  • Lie*
  • Lying
  • Moral*
  • Self*
  • Social*

It should be said that I had a pretty specific hypothesis going into this, but I've found some odd stuff in the data. I'm not really sure where it comes from yet, so this is just an initial exploration.

First, we import our needed modules.


In [2]:
import pandas as pd
import numpy as np
from ggplot import *

Next, I get the location of the file on disk and read it in. I also rename one of the columns, as the presence of symbols in variable names can sometimes cause trouble. I don't think it would have been a problem here, but I did it anyway


In [5]:
location = r'/Users/triddle/Documents/riddlet.github.io/_drafts/Travis Psychology Terms.xlsx'

df = pd.read_excel(location, 3) #the good stuff is on sheet 4
df.rename(columns={'# of RESULTS':'Number of Results'}, inplace=True)

Okay, this part is ugly. I'm sure there's a better way here.

The cell below creates a new dataframe called year_props. The first line creates a df in which year and term are grouped together as an index and each entry has an associated number of results. The second line takes that dataframe and computes the proportion for each grouping (i.e. the proportion of all hits that term takes up in that year).


In [10]:
year_term = df.groupby(['YEAR', 'TERM']).agg({'Number of Results': 'max'})
year_props = year_term.groupby(level=0).apply(lambda x: x/float(x.max()))

Continuing with the ugly, we create empty series for terms and years and, using a for-loop, grab all the all the terms and years in the index of the new dataframe we just created. We then use these values to create terms and years as new variables in the years_props dataframe.

So, in other words, the above cell had the unfortunate side-effect of turning some of our variables into indices. I'd rather have them as variables (though I'm sure there's a very strong case to be made for keeping them as indices), and so I have to undo what I just did. The following lines do that:


In [11]:
terms = []
years = []
for i in year_props.index.values[:]:
    terms.append(i[1])
    years.append(i[0])

year_props['TERM'] = terms
year_props['YEAR'] = years

And now we rename the new variable for proportion, and then merge our new dataframe with our old dataframe, joining them on the columns for Year and Term.


In [12]:
year_props.rename(columns={'Number of Results': 'Proportion'}, inplace=True)
df_2 = pd.merge(df, year_props, on=['YEAR', 'TERM'])

Because I love speaking in grammer of graphics, we use python's ggplot. I tried doing this with matplotlib for a while, but just couldn't get the plot that I wanted without what seemed like way too many lines of code. Also, we're not using the 'Total #' term (filtered using df_2[df_2.TERM != 'Total #']. This is due to the fact that we're plotting proportions of the total, which means this is just a flat line at 1.


In [13]:
p = ggplot(df_2[df_2.TERM != 'Total #'], aes('YEAR', 'Proportion'))
p + geom_line() + facet_wrap('TERM') + theme_bw()


Out[13]:
<ggplot: (292200233)>
See anything weird? Let's zoom in a bit, only focusing on search results since 1990:

In [22]:
df_recent_papers = df_2[(df_2.YEAR >= 1990) & (df_2.YEAR <= 2000)]
p = ggplot(df_recent_papers[df_recent_papers.TERM != 'Total #'], aes('YEAR', 'Proportion'))
p + geom_line() + facet_wrap('TERM') + theme_bw()


Out[22]:
<ggplot: (295661981)>

In [128]:
p = ggplot(df_recent_papers[df_recent_papers.TERM == 'Total #'], aes('YEAR', 'Number of Results'))
p+geom_line()


Out[128]:
YEAR Number of Results TERM Proportion
0 2013 374 Honest OR Honesty 0.006498
1 2012 358 Honest OR Honesty 0.006510
2 2011 302 Honest OR Honesty 0.005622
3 2010 208 Honest OR Honesty 0.004239
4 2009 225 Honest OR Honesty 0.005066

In [ ]: